Hade a great time again, thanks Eric. And got once on the global leaderboard, so can die in peace now :)
Since 2018 I’m participating in 🎅Advent of Code🎅 and enjoying it a great deal. Since AoC has been running since 2015 there has been a sizable amount of data generated. Let’s see what we can learn, starting with the amount of stars awarded each season.
# hideimport numpy as npimport altair as altimport pandas as pdimport database as dbalt.data_transformers.disable_max_rows()def sfmono(): font ="SF Mono" font ="Lato" font ="Arial" fontsize =16return {'height': 300, 'width': 450,"config" : {"title": {'font': font, 'titleFontSize':100},"axis": {"labelFont": font,"labelFontSize": fontsize,"titleFont": font,"titleFontSize": fontsize, },"header": {"labelFont": font,"labelFontSize": fontsize,"titleFont": font,"titleFontSize": fontsize, },"legend": {"labelFont": font,"labelFontSize": fontsize,"titleFont": font,"titleFontSize": fontsize, }, } }alt.themes.register('sfmono', sfmono)alt.themes.enable('sfmono')conn = db.open_db('aoc.db')# no points for 2020 day 1 and 2018 day 6db.do(conn, "UPDATE scores SET points = 101 - position")db.do(conn, "UPDATE scores SET points = null where year = 2020 and day = 1")db.do(conn, "UPDATE scores SET points = null where year = 2018 and day = 6")
OperationalError: database is locked
# hide sql =""" SELECT year, ROUND((SUM(both) * 2 + SUM(first)) *1.0 / 100000, 2) as million_stars FROM finishers GROUP BY 1"""# -- SELECT * FROM scores LIMIT 5data = db.do(conn, sql)df = pd.DataFrame([{'year': year, 'image':'star'} for year, count in data for _ inrange(int(count))])
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
In total there have now been more than 14M stars awarded! After a ‘rough’ 2016, AoC has been steadily growing with 2020 as a (Covid?) 100% boost. And 2022 is just barely finished, many people will earn stars in the days to come.
Edit: as BBQspaceflight indicated on the AoC Reddit, probably 2016 was not a rough year, but many people have been solving 2015 at a later time (e.g. they participated in 2018 and afterwards did 2015).
# hide_inputdf = db.do_df(conn, 'SELECT * FROM scores WHERE star = 2 ORDER BY time')df['puzzle'] = df['year'].astype(str) +' - '+ df['day'].astype(str)df['time'] /=60stripplot = alt.Chart(df, width=40).mark_circle(size=8).encode( x=alt.X('jitter:Q', title=None, axis=alt.Axis(values=[0], ticks=True, grid=False, labels=False), scale=alt.Scale(), ), y=alt.Y('time:Q', title=None, axis=alt.Axis(tickCount=2, values=[0, 60,])),# color=alt.Color('puzzle:N', legend=None), column=alt.Column('day:N', header=alt.Header( labelAngle=-90, titleOrient='bottom', labelOrient='bottom', labelAlign='right', labelPadding=3, ), ), row=alt.Row('year:N', title=None, spacing=10)).transform_calculate(# Generate Gaussian jitter with a Box-Muller transform jitter='sqrt(-2*log(random()))*cos(2*PI*random())').configure_facet( spacing=0).configure_view( stroke=None).properties( width=18, height=120).configure_mark( color='darkgreen')stripplot
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
For a better plot, look at Maurits vd Schee. I only plotted the full day. It’s a familiar sight: in general the times are below the hour mark, with a couple going higher. After 2020the completion times are more compressed. This year’s most difficult puzzle’s were 16 (Proboscidea Volcanium, tunnels with valves), 19 (Not Enough Minerals, collect geodes) and 22 (it’s a cube?)
Low completion times can be a result of two factors: * The puzzles were easier * The participants where better / more competitive
One way of investigating the difficulty of a year is by analyzing the completion rate: how many people got all the stars compared to the people that got only 1 star of day 25. These people did make it to day 25, thus put a considerable amount of effort in, but couldn’t finish all puzzles.
I didn’t wanted to take all participants from day 1, since that number quickly drops during the first couple of days.
# hidefrom pprint import pprintsql =""" SELECT f1.year, f1.both, f1.first as only_first, ROUND(f1.both * 100 / (f1.first + f1.both),0) as perc_day_25_both FROM finishers f1 WHERE day = 25"""# -- SELECT * FROM scores LIMIT 5df = db.do_df(conn, sql)df['x1'] = df['both'].cumsum()df['x0'] = df['x1'].shift(fill_value=0)
# hide_inputalt.Chart(df).mark_rect().encode( alt.X('x0', title='# participants finishing all puzzles', axis=alt.Axis(labelFontSize=16, format='.2s')), alt.X2('x1'), alt.Y('perc_day_25_both:Q', title='% day 25 participants obtaining all stars', axis=alt.Axis(labelFontSize=10)), color=alt.Color('year:O', scale=alt.Scale(scheme='greens')))
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
In the above chart, each rectangle symbolizes the people that solved all puzzles during the year. The height shows the completion rate.
The completion rate was very high in 2016 and 2017 and lowest in 2018. In 2020 many people finished all puzzles, corresponding with more participants that year.
For 2022 the verdict is still out, in a couple of months people will have had time to finish so the completion rate will stabilize. I find it interesting that it’s higher than 2021 already.
# hide_inputsql ="""WITH year_scores AS ( SELECT year, user, SUM(points) as total FROM scores GROUP BY 1,2 ORDER BY 3 DESC), started AS( SELECT year, both FROM finishers WHERE day = 25 GROUP BY 1 )SELECT y.year, COUNT(y.year) as 'participants with points', s.both as 'participants finishing all puzzles'FROM year_scores yJOIN started sON s.year = y.yearGROUP BY y.year"""df = db.do_df(conn, sql)df['% Finishers that got points'] = (df['participants with points'] / df['participants finishing all puzzles'] *100).astype(int)descriptives = alt.Chart(df).mark_bar().encode( alt.X(alt.repeat("column"), type='quantitative'), alt.Y('year:O', title=None)).properties(width=100, height=200).repeat( column=['participants finishing all puzzles', 'participants with points', '% Finishers that got points']).configure_mark( color='darkgreen')descriptives
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
Again note that 2022 is very fresh still.
We see that: * The amount of finishers (people getting all stars) peaked in 2020 * The amount of people that got points on the leaderboard is slowly increasing with 2015 also being very high. Why? * The percentage of finishers getting points is varying. Lower percentages could indicate how competitive the year was. Another indicator can be the time it took to solve a puzzle.
# hide_inputsql ="""WITH totaltime AS( SELECT year, position, ROUND(SUM(time) * 1.0 / 3600, 1) as total FROM scores WHERE (position = 1 or position = 100) and star = 2 GROUP BY 1,2 )SELECT t1.year, t1.position, t1.total as 'total_time (hours)'FROM totaltime t1GROUP BY 1,2"""times = db.do_df(conn, sql)chart = alt.Chart(times).mark_line(color='#00cc00').encode( x=alt.X('total_time (hours):Q', title='Total time to complete all 25 days'), y='year:N', detail='year:N')# Add points for life expectancy in 1955 & 2000points = chart.mark_point( size=100, opacity=1, filled=True).encode( x='total_time (hours):Q', y='year:N', color=alt.Color('position:O', scale=alt.Scale( domain=['1','100'],range=['#00cc00', '#009900'] ) ))chart + points
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
The fastest completion times add up to around 3 hours, which is amazing. Since nobody ever finished #1 at all puzzles, this is a theoretical minimum.
The completion times of #100 add up to a more ‘human’ amount. These times are still way below the amount of time a ‘normal’ participant spends on AoC. For example I consider myself an enthusiast, but my completion times are normally about 2-3x the #100.
2022 was very similar compared to 2021.
# hide_inputdf2 = df.merge(times.loc[times['position']==100])points = alt.Chart(df2).mark_point(color='darkgreen').encode( x=alt.X('participants finishing all puzzles:Q', scale=alt.Scale(type='linear')), y=alt.Y('total_time (hours):Q', scale=alt.Scale(type='linear'), title='Total time #100'))text = points.mark_text( align='left', baseline='middle', dx=7).encode( text='year')points + text# Here is where things get hazy: there seems to be some kind of relation between the % of participants that get points and the total time taken adding up all the days #100 LB. Just leaving this here for your imagination.
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
2022 will move more to the right given time. There does seem to be some correlation between the total time the #100 took and the amount of participants finishing all puzzles. Although I’m not sure if it’s causal, could be there are other variables playing a role.
Getting leaderboard points is special (I made once and got 39 points, mission complete!!!). There are people who do it consistently. Let’s give the top 30 some extra recognition🎈
# hide_inputsql ="""WITH year_scores AS ( SELECT year, user, SUM(points) as points,SUM(time) as time, COUNT(user) as counts FROM scores GROUP BY 1,2 ORDER BY 3 DESC)SELECT user, COUNT(user) as amount_seasons, CAST(SUM(ifnull(points,0)) AS INT) as total_points, SUM(counts) as total_leaderboard_placements, CAST(AVG(counts) AS INT) as average_amount_lb_placements, CAST(AVG(points) AS INT) as avg_points_seasonFROM year_scoresGROUP BY 1ORDER BY 3 DESC"""df = db.do_df(conn, sql)alt.Chart(df[:30]).mark_point(color='darkgreen').encode( alt.Y('user:N', sort=['total_points']), alt.X(alt.repeat("column"), type='quantitative', axis=alt.Axis(labelFontSize=15)),).properties( width=175, height=450).repeat(# alt.Y('user:N', sort=['total_points']), column=['total_points', 'total_leaderboard_placements'],)
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
All the people on this top 30 list are amazing, but some awards to hand out: * 🏆Robert Xiao managed to get the most amount of points and overall most leaderboard placements * 🏆betaveros got on average most points & leaderboard entries (ignoring anonymous user here). betaveros also managed to get 50 entries is 2018, which was a one-time event * 🏆glguy for getting the highest score while getting points in all 8 seasons
Doing AoC once and get LB points is nice, but it’s even nicer to do it twice, thrice, etc.
# hide_inputalt.Chart(df).mark_bar().encode( alt.X('amount_seasons:N', title='# seasons to get points'), alt.Y('count()', title='amount',scale=alt.Scale(type='log'))).configure_mark( color='darkgreen')
Most of the people that get points manage to do it only once. The y-axis is logaritmic. Who are having so much grit to get points all 8 seasons?
We lost Daniel Rejment and Shane Mc Cormack compared to last year. glguy topping the list. Coming back to our competitiveness discussion, how many points did they score together?
# hide_inputsql ="""WITH year_scores AS ( SELECT year, user, SUM(points) as total FROM scores GROUP BY 1,2 ORDER BY 3 DESC),'toppers' AS ( SELECT user FROM year_scores GROUP BY 1 HAVING count(user) > 7)SELECT s.year, s.user, s.total as pointsFROM toppers as tJOIN year_scores as sON s.user = t.user"""# -- SELECT * FROM scores LIMIT 5df = db.do_df(conn, sql)df['user'].value_counts()alt.Chart(df).mark_bar(opacity=0.8).encode( x='year:N', y='sum(points)', color=alt.Color('user')).properties( width=400, height=300)
d:\Programmeren\adventofcode\.aocenvpy311\Lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
It’s varying but 2020 and 2021 and 2022 are lower. This could reflect: * increased competitiveness during the years * natural variation * legends getting older😊
All in all an amazing achievement! All in all I think there is a strong case for AoC being more competitive after 2019.
Let’s finally turn to which puzzles were easiest or hardest.
# hide_inputsql =""" SELECT year || '-' || day || '-' || star as puzzle, user, time as "time (seconds)" FROM scores WHERE position=1 ORDER BY 3"""# -- SELECT * FROM scores LIMIT 5df = db.do_df(conn, sql)sql =""" SELECT year || '-' || day || '-' || star as puzzle, time as "lb full (seconds)" FROM scores WHERE position=100 GROUP BY 1 ORDER BY 2"""# -- SELECT * FROM scores LIMIT 5df_100 = db.do_df(conn, sql)df[:5]
puzzle
user
time (seconds)
0
2022-3-1
ostwilkens
10
1
2022-4-1
max-sixty
16
2
2019-1-1
bluepichu
23
3
2018-1-1
Tris Emmy Wilson
26
4
2021-1-1
betaveros
28
Previous year we had a record of 23 second, but thanks to ChatGPT we have 2 new ‘easiest’ puzzles.
# hide_inputdf_100[:5]
puzzle
lb full (seconds)
0
2021-1-1
65
1
2022-1-1
76
2
2019-1-1
84
3
2018-1-1
92
4
2021-2-1
98
The leaderboard capped (the #100 completed the puzzle) after barely a minute in 2021 for the first star!
The longest 3 solve times were all in 2018! Shoutout to Simon Parent for solving 2 out of the top 3. This list mostly has puzzles that just take a long time to code, with Beverage Bandits as perfect example.
# hide_inputdf_100 = df_100.loc[df_100['puzzle'].str.endswith('2'),:].copy()df_100['lb full (seconds)'] = (df_100['lb full (seconds)'] /60).astype(int)df_100 = df_100.rename(columns={'lb full (seconds)':'lb full (minutes)'})df_100 = df_100.sort_values('lb full (minutes)', ascending=False)df_100['title'] ='Title'df_100.iloc[:10,2] = ['Medicine for Rudolph', 'Not Quite Lisp', 'Wizard Simulator 20XX', 'Radioisotope Thermoelectric Generators', 'Beverage Bandits', 'Slam Shuffle', 'Many-Worlds Interpretation', 'Experimental Emergency Teleportation', 'Grid Computing', 'Immune System Simulator 20XX']df_100[:10]
puzzle
lb full (minutes)
title
399
2015-19-2
232
Medicine for Rudolph
398
2015-1-2
186
Not Quite Lisp
397
2015-22-2
183
Wizard Simulator 20XX
396
2016-11-2
164
Radioisotope Thermoelectric Generators
393
2018-15-2
143
Beverage Bandits
392
2019-22-2
123
Slam Shuffle
390
2019-18-2
117
Many-Worlds Interpretation
388
2018-23-2
100
Experimental Emergency Teleportation
387
2016-22-2
88
Grid Computing
386
2018-24-2
87
Immune System Simulator 20XX
If we look at when the leaderboard capped some different puzzles show up. I feel that this list has some more algoritmic challenges (Slam Shuffle for example, but Medicine for Rudolph as well). Also 2015 shows up in the top 3. If these puzzles would be recycled in 2021 they would have been solved faster.
Overall, I feel 2018 is a strong contender for the most difficult year, with day 15 the most difficult puzzle in the history of Advent of Code!
Hope you enjoyed this analysis and see you back next year! 🎄⭐🎅